How to implement Bayesian Optimization in Python
Author :: Kevin Vecmanis
In this post I do a complete walk-through of implementing Bayesian hyperparameter optimization in Python. This method of hyperparameter optimization is extremely fast and effective compared to other “dumb” methods like GridSearchCV and RandomizedSearchCV.
In this article you will learn:
- What Bayesian Optimization is.
- Why you need to know it.
- How to use the hyperopt library - an implementation of this method in Python.
- How to structure your objective functions.
- How to save the Trials() object and load it later.
- How to implement it with the popular XGBoost classification algorithm.
- How to plot the Hyperopt search pattern.
Table of Contents
- Introduction: Taking people out of the loop
- What is Hyperopt
- Setting up GridSearch and RandomizedSearch
- Setting up Hyperopt for intelligent search
- The objective function
- The search space
- The fmin function
- Saving the Trials() object
- Using Hyperopt to tune XGBoost
- Visualizing Hyperopt’s Search Pattern
Introduction: Taking People Out of the Loop
If you have ever done a parameter search using GridSearchCV
or RandomizedSearchCV
, you understand how quickly the time requirements for these searches can explode when you want to do a comprhensive search for the best solution (search space). Bayesian Optimization is an amazing solution to this problem, and offers a more ‘intelligent’ search strategy.
Bayesian Optimization works building a probability-based model, sequentially, and adjusting that model after each iteration. There is a lot of research on this optimization method available, but in this post we’re going to focus on the practical implementation in Python.
You can read a paper on Bayesian Optimization here: Link to Bayesian Optimization paper
Bayesian Optimization is a must have tool in a data scientist’s tool kit - simply because it outperforms other methods of parameter search dramatically.
Throughout the rest of the article we’re going to introduct the Hyperopt library - a fantastic implementation of Bayesian Optimization in Python - and use to to compare algorithm performance against grid search and randomized search.
Hyperopt
Hyperopt is a Python implementation of Bayesian Optimization. Throughout this article we’re going to use it as our implementation tool for executing these methods. I highly recommend this library!
Hyperopt requires a few pieces of input in order to function:
- An objective function
- A Parameter search space
- The hyperopt minimization function
I’m going to walk through how to build each of these, but first let’s assemble our toy dataset.
The next thing we’re going to do is set up an implementation of GridSearchCV
and RandomizedSearchCV
so that we can compare their performance on this dataset to Hyperopt.
Note: To use hyperopt you’ll need to open a terminal and run:
$ pip install hyperopt
Setting up GridSearch and RandomizedSearch
RandomizedSearchCV took 5.57 seconds for 200 candidates
-0.3025122527046663 {'probability': True, 'kernel': 'rbf', 'degree': 4, 'C': 0.85}
GridSearchCV took 43.80 seconds
-0.298021975049231 {'C': 0.975, 'degree': 4, 'kernel': 'rbf', 'probability': True}
Setting up Hyperopt for Intelligent Search
For Hyperopt, we have to define a function that the hyoperopt ‘Tree of Parzen Estimators’ (TPE) algorithm will seek to minimize, as well as a new search space that’s in the appropriate format for the hyperopt algorithm. Our GridsearchCV
and RandomizedSearchCV
defaulted to 3-Fold cross validation so we will replicate that in our objective function.
Because the natural tendency of fmin
is to minimize the score from the objective function, we’ll multiply our cross_val_score by negative 1 to make it positive. Take caution to assess this on a case-by-case basis. Here we’re using neg_log_loss
as a scoring function. Lower absolute log loss scores are ideal, so we need to multiple this score by -1 to make it a positive integer. If we didn’t, Hyperopt would seek to make the neg_log_loss
value more and more negative which would increase the absolute log loss value!
We define one new function: An objective
function with output that we seek to minimize.
The Objective Function
The Search Space
Hyperopt needs a search space from which to sample and select hyperparameters. The search space will be different for each algorithm that you work with. Here is our search space for SVC
which captures most of the main hyperparameters. Note that you can add more parameters to this if you wish.
The fmin function
The last piece of the equation is Hyperopt’s fmin
function, which will take the following arguments:
- Our
objective
function which produces the value Hyperopt attempts to minimize. - A sample of our search
space
algo
: denotes the algorithm to use to build the bayesian model.max_evals
: The number of sequential iterations to run (number of search space samples to test)Trials()
: The trials objective is an interesting feature because it allows you to store the progress of the bayesian and then pick-up where you left off at a later time.
Hyperopt search took 1.89 seconds for 25 candidates
-0.2869846539767595 {'C': 193, 'x_degree': 1, 'x_kernel': 2, 'x_probability': 0}
Saving the Trials() Object
The Trials()
object can be stored using pickle
and then reloaded later like this:
Note that hyperopt
was only permitted to run 25 trials, and found a better score than both GridSearchCV
and RandomizedSearchCV
which each used 200 trials.
Now that we have an introduction to Hyperopt, let’s do another example - this time using XGBoost
.
Using Hyperopt to tune XGBoost
Let’s use the same toy dataset and see if we can get XGBoost to beat our baseline score of -0.28698
achieved previously.
Our code is going to look like this - these pieces should be familiar to you by now!
Hyperopt search took 28.61 seconds for 200 candidates
Best score: -0.21250464306833847
Best space: {'x_colsample_bylevel': 11, 'x_colsample_bytree': 5, 'x_learning_rate': 13, 'x_max_depth': 1, 'x_min_child_weight': 8, 'x_n_estimators': 6, 'x_subsample': 11}
We can see that given 200 trials, Hyperopt was able to get XGBoost to produce a score that outperformed our previous baseline.
Visualizing Hyperopt’s Search Pattern
Next we’re going to modify our function a little bit to capture the history of the scores versus time so we can get a visual of what Hyperopt is doing.
The search pattern of Hyperopt is interesting. We can see that as the number of iterations progress, the algorithm attempts new permutations of the hyper parameters and then converges them quickly back to a minima.
We can also plot a histogram of these results to see where the score cluster.
Summary
- Compared to
GridSearchCV
andRandomizedSearchCV
, Bayesian Optimization is a superior tuning approach that produces better results in less time. - With
hyperopt
, the trial history can be saved and the training process continued by reloading theTrials()
object. Hyperopt
requires the creation of a custom search space and objective function.- Bayesian optimization is an essential tool for any machine learning engineer or data scientist!
I hope you enjoyed this article!